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A THEORETICAL COMPARISON OF THE DATA 
AUGMENTATION, MARGINAL AUGMENTATION AND PX-DA 

ALGORITHMS 

By James P. Hobert 1 and Dobrin Marchev 2 
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The data augmentation (DA) algorithm is a widely used Markov 
chain Monte Carlo (MCMC) algorithm that is based on a Markov 
transition density of the form p(x\x') = J Y fx\Y(x\y)fY\x{y\x')dy, 
where fx\Y an d fy\x are conditional densities. The PX-DA and 
marginal augmentation algorithms of Liu and Wu [J. Amer. Statist. 
Assoc. 94 (1999) 1264-1274] and Meng and van Dyk [Biometrika 86 
(1999) 301-320] are alternatives to DA that often converge much 
faster and are only slightly more computationally demanding. The 
transition densities of these alternative algorithms can be written in 
the form p R (x\x') = J y J Y fx\Y{x\y')R{y,dy')f Y \x(y\x')dy, where R 
is a Markov transition function on Y. We prove that when R satisfies 
certain conditions, the MCMC algorithm driven by pr is at least as 
good as that driven by p in terms of performance in the central limit 
theorem and in the operator norm sense. These results are brought 
to bear on a theoretical comparison of the DA, PX-DA and marginal 
augmentation algorithms. Our focus is on situations where the group 
structure exploited by Liu and Wu is available. We show that the 
PX-DA algorithm based on Haar measure is at least as good as any 
PX-DA algorithm constructed using a proper prior on the group. 

1. Introduction. 

1.1. Background. In statistical problems where there is a need to explore 
an intractable density, fx(x), there is sometimes available a joint density 
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f{x,y), on X x Y say, such that f Y f(x,y) dy = fx{x) and such that simu- 
lating from the conditional densities, fx\Y( x \y) an d fy\x{y\ x ), is straight- 
forward. In such situations, one can apply the data augmentation (DA) 
algorithm (Tanner and Wong [22]), which is a Markov chain Monte Carlo 
(MCMC) algorithm based on the Markov transition density (Mtd) given by 

(1) p( x \x') = J^fx\Y( x \y)fY\x(y\x')dy. 

It is well known and easy to show that p{x\x') is reversible with respect to 
fx, which implies that fx is an invariant density. Like its cousin, the EM 
algorithm, the DA algorithm is considered a useful algorithm that sometimes 
suffers from slow convergence. 

The PX-DA algorithm (Liu and Wu [13]) and the closely related marginal 
augmentation (MA) algorithm (Meng and van Dyk [14]) are alternatives to 
DA that often converge much faster and are only slightly more computation- 
ally demanding. The basic idea is to use f(x,y) to create an entire family 
of joint densities that all have fx as the x marginal. Each member of this 
family can be used to form a DA algorithm and the hope is that some of 
the resulting algorithms will be significantly better than the original. To be 
specific, consider a class of functions t g : Y — > Y for g £ G such that, for each 
fixed g, t g {y) is one-to-one and differentiable in y. Suppose further that 
r(g) is a probability density on G and define another probability density 
/:XxYxG^[0,oo) as f(x,y,g) = f{x,t g (y))\J g (y)\r(g), where J g (z) is 
the Jacobian of the transformation z = t~ 1 (y). Let f(x,y) = f G f(x,y, g) dg 

and note that f Y f(x,y) dy = fx(x)- The PX-DA algorithm (which is the 
same as the MA algorithm in this situation) is simply the alternative DA 
algorithm based on the Mtd given by 

(2) Pr{x\x') = J^f x \ Y {x\y)f Y \x{y\x)dy. 

By varying r(-), we can create the family of joint densities mentioned above. 
Liu and Wu [13], Meng and van Dyk [14] and van Dyk and Meng [23] 
(hereafter, L&W, M&vD and vD&M) have provided many examples where 
this strategy leads to major improvements over standard DA algorithms. 

Straightforward sampling from f x \Y aR d fy\Xi which is necessary if the 
PX-DA algorithm is to be useful in practice, is made possible by exploiting 
the relationship between these conditionals and the joint density f(x,y,g). 
First, consider sampling from fy\x an d note that 



fv\x{y\x) = / f Y \ x (tg(y)\x)\Jg(y)\r(g)dg. 



G 



Consequently, we can draw from fy\x by drawing y' and g independently 
from fY\x{y'\ x ) an d r(g), respectively, and setting y = t~ 1 {y l ). Now let 
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fy(y) = Jx /( x >2/) dx and let w(g;y) denote the density proportional to 
r(g)\J g (y)\fY(tg(y))- We can draw from } X \ Y by drawing g from w(g;y) 
and then x ~ fx\Y{ x \tg{y))- Putting all of this together, as in [13], Scheme 
1.1, a single iteration of the PX-DA algorithm (V — ► x) can be accomplished 
by performing the following three steps: 

1. Draw y ~ fy\x{vW)- 

2. Draw g~r(-), draw g' from w(g';t~ 1 (y)) and set y' = t g /(t~ 1 (y)). 

3. Draw x ~ /xiy^ly') - 

Note that the first and third steps are exactly the same as the two steps of the 
DA algorithm. Given that w(g; y) contains the term fy (t g (y)) and that direct 
sampling from fy is infeasible (otherwise MCMC would be unnecessary), 
one might expect that sampling from w(g; y) would be difficult. However, as 
the examples in [13, 14, 23] illustrate, when g has lower dimension than y, 
sampling from w(g;y) can be completely straightforward, adding very little 
to the overall computational burdon. 

We use Albert and Chib's [1] DA algorithm for Bayesian probit regression 
as a running example. Let Vi, V2, ■ ■ ■ , V n denote independent random vari- 
ables with Vi I P ~ Bernouilif^z? 1 /?)) where Z{ is a p x 1 vector of known 
covariates associated with Vi, (3 is a p x 1 vector of unknown regression co- 
efficients and <£(•) is the standard normal distribution function. A flat prior 
on (3 leads to an (intractable) posterior density given by 

m(v) 
y ' i=i 

where m(v) is the marginal mass function. Let R + = (0, 00), 1R _ = (— 00, 0] 
and consider the function 



ir((3,y\v 



m(v) 



~[{Ir+ (yi)i{i}(vi) + Ir- {yi)i{Q}{vi)}4>{yi; zfp, l) 



where y = (y\, yi, ■ ■ ■ , y n ) T G -Ta( - ) is the indicator of the set A and 
(f)(x; fi,a 2 ) denotes the N(/i,cr 2 ) density function evaluated at the point x. 
Straightforward calculations show that ir((3,y\v) is a joint density in (P,y) 
whose (3 marginal is the target, n(P\y). Moreover, n(P\y,v) is a multivari- 
ate normal density and n(y\P, v) is a product of n truncated univariate 
normal densities. Albert and Chib's algorithm alternates between these two 
conditionals. L&W developed a PX-DA algorithm for this problem by taking 
tg(y) =9y andG?= (0,oo). This yields w(g; y) ccr(g)g n ex.p{-g 2 y T My/2}I R +(g), 
where M is a known n x n matrix. Drawing from the multivariate density 
Tv(y\v) does not appear straightforward, but sampling from the univariate 
density w(g\ y) is easy as long as r(g) has a simple form. Indeed, L&W 
take r(g) oc g a ~ l e~ b9 (g) where a, b > 0, which allows one to sample from 
w(g;y) by drawing a gamma variate and taking the square root. 
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1.2. A general class of alternatives to DA. Step 2 of the PX-DA algo- 
rithm involves making the transition y — > y' and can therefore be interpreted 
as simulating one step of a Markov chain on Y. In fact, Theorem 1 in [13] 
shows that fy is an invariant density for this chain. Thus, the Mtd of the 
PX-DA algorithm is a special case of the general Mtd given by 



where R(y,dy') is any Markov transition function (Mtf) on Y that has fy 
as an invariant density. Routine calculations show that fx is invariant for 
Pr and that, if R is reversible with respect to fy, then pr is reversible with 
respect to fx- In this paper, we perform the first general study of (3). The 
main results provide conditions under which the Markov chain driven by (3) 
is better than the corresponding DA algorithm. To be specific, we show that 
if pr is reversible with respect to fx, then pr is at least as good as p in the 
efficiency ordering of Mira and Geyer [16], which concerns performance in 
the central limit theorem (CLT). (For a cleaner exposition, we henceforth 
write "better than" instead of the more accurate "at least as good as.") We 
also show that if pr is itself a DA algorithm; that is, if there exists a joint 
density f*(x,y) such that f Y f*(x,y) dy = fx(x) and such that pr can be 
reexpressed as 



then pr is better than p in the operator norm sense (Liu, Wong and Kong 



Because the Mtds of the DA, MA and PX-DA algorithms can all be 
written in the form (3), our general results concerning (3) can be brought 
to bear on a theoretical comparison of these algorithms. This yields both 
new results and generalizations of known results from [13, 14] and [23]. 
Furthermore, our proofs of the generalizations are simpler and require fewer 
regularity conditions than the original proofs. It is our hope that the results 
herein will promote theoretical and methodological development of improved 
DA algorithms. 

Here is a simple example of the application of our results concerning (3). 
The PX-DA algorithm is, by definition, a DA algorithm and as such is re- 
versible with respect to fx- Hence, the results described above are applicable 
and imply that every PX-DA algorithm is better than the DA algorithm in 
the efficiency ordering and in the operator norm sense. The efficiency order- 
ing result is new, but the operator norm result is known — see Theorem 2 
in [13] and Theorem 1 in [14]. Note that we say "every PX-DA algorithm." 
This is because the result holds no matter what (proper) density r{g) is used 
to construct the PX-DA algorithm. 



(3) 





[11])- 



A COMPARISON OF DATA AUGMENTATION ALGORITHMS 



5 



1.3. Adapting to an improper r{g): Liu and Wu' ' s group structure. L&W, 
M&vD and vD&M all argued that the PX-DA algorithm should perform bet- 
ter as the density r(g) becomes more "diffuse" or "spread out," and they 
provided empirical evidence supporting this claim. It is clearly impossible to 
implement the PX-DA algorithm in the limiting case where r is improper. 
However, L&W and M&vD found (what appear to be) different ways of 
utilizing an improper r(g) to construct an algorithm that achieves the lim- 
iting convergence rate. L&W developed their results by exploiting a certain 
group structure that may be present in the problem. M&vD, on the other 
hand, constructed a nonpositive recurrent Markov chain onXxG having 
stationary density fx{x)r{g) and provided conditions under which the x 
component of that chain is itself a Markov chain with invariant density fx- 
We focus on L&W's approach and show that, when the group structure ex- 
ists, L&Ws algorithm is exactly the same as M&vD's algorithm (under a 
particular improper working prior). This is the first formal comparison of 
the two limiting algorithms. We now briefly describe L&Ws group structure 
and limiting algorithm. 

Suppose that G is a topological group; that is, a group such that the 
functions (31,52) l— > <7i<?2 and g 1— > g~ 1 are both continuous. Let e denote 
the group's identity element. (An example is the multiplicative group, R + , 
where group composition is defined as multiplication, the identity element 
is e = 1 and g^ 1 = l/g.) Suppose further that t e (y) = y for all y G Y and 
that t gig2 (y) = t gi (tg 2 (y)) for all 51,32 G G and all y G Y. Assume that G is 
a unimodular group and let v{dg) denote Haar measure on G. One iteration 
of L&Ws limiting algorithm, which we call the Haar PX-DA algorithm, 
consists of the following three steps: 

1. Draw y~ fy\x{vW)- 

2. Draw g from the density (with respect to u) proportional to \J g (y)\fY(t g (y)) 
and set y' = t g (y). 

3. Draw x ~ fx\Y{Av')- 

Note that the Haar PX-DA algorithm actually requires less computation 
than the PX-DA algorithm. Indeed, Step 2 involves only a single draw from 
a distribution on G, while the middle step of the PX-DA algorithm requires 
two such draws. The Mtd associated with this algorithm has fx as an in- 
variant density (see L&W) and is, in fact, another special case of (3). (Note 
that the invariance of fx is not obvious in this case because, unlike PX-DA, 
the Haar PX-DA algorithm is not defined as an alternative DA algorithm.) 
L&W proved that the Haar PX-DA algorithm is better in the operator norm 
sense than every PX-DA algorithm. 

Consider again the probit regression example. The multiplicative group, 
G = K + , is unimodular with Haar measure given by v(dg) = dg/g where 
dg denotes Lebesgue measure. Furthermore, the transformation t g (y) = gy 
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satisfies the compatibility conditions described above, so the Haar PX-DA 
algorithm is applicable. As shown in [13], the middle step entails drawing g 
from a density proportional to g n ~ x exp{— g 2 y T My}I R +(g). Both L&W and 
vD&M provide strong empirical evidence that this algorithm can converge 
much faster than Albert and Chib's [1] DA algorithm. 

1.4. Comparing general versions of PX-DA and Haar PX-DA. We de- 
velop generalizations of the PX-DA and Haar PX-DA algorithms in a setting 
where X , Y and G are abstract spaces (not necessarily Euclidean) and the 
group G is not required to be unimodular (Haar measure is replaced by left- 
Haar measure). This is accomplished in two steps. First, the group structure 
is used to build Mtfs, Q r (y,dy') and Q(y,dy'), that are reversible with re- 
spect to fy and that behave like general versions of the middle steps of the 
PX-DA and Haar PX-DA algorithms. Then Mtds for the generalized ver- 
sions of PX-DA and Haar PX-DA are formed by using Q r and Q in place of 
R in (a generalized version of) (3). Because L&W did not use the term "Haar 
PX-DA," it is important to bear in mind throughout this paper that what 
we call the "general Haar PX-DA algorithm" is, in fact, a generalization of 
L&W's limiting PX-DA algorithm. 

A comparison of the resulting generalized algorithms is facilitated by a 
representation of Haar PX-DA as an improvement of PX-DA. More specifi- 
cally, we show that there exists a joint density f(x,y), whose x marginal is 
fx, such that the Mtd of the general PX-DA algorithm can be written as 



where n x (dx) and fJ> y (dy) are the analogues of dx and dy that will be defined 
in Section 3. (This, of course, implies that PX-DA is better than DA.) We 
then show that the Mtd of the general Haar PX-DA algorithm can be written 

as 



/y(y); that is, p* is an improvement of p r . It is also shown that p*(x\x') 
is itself a DA algorithm. Therefore, our results concerning (3) imply that 
p*(x\x') is better than every version of p r (x\x') in the efficiency ordering 
and in the operator norm sense. As before, the efficiency ordering result is 



(4) 






where / is as in (4) and Q(y, dy') is reversible with respect to f x f(x, y)fi x (dx) 
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new, but a special case of the operator norm result was established in [13] 
(see Section 5 for details). 

The remainder of the paper is laid out as follows. In Section 2, we set 
notation and review some results from general state space Markov chain 
theory. Our study of (3) commences in Section 3. In Section 4, we describe 
two different methods of using a group action to construct a Mtf with a 
prespecified stationary distribution. Finally, our general versions of the PX- 
DA and Haar PX-DA algorithms are introduced and studied in Section 5. 

2. Markov chain background. As in Meyn and Tweedie ([15], Chap- 
ter 3) let P(x, dy) be a Mtf on a set X equipped with a countably gen- 
erated cr-algebra £>(X). Suppose that tt is an invariant probability mea- 
sure; that is, tt(A) = J x P(x, A)ir(dx) for all measurable A. Denote the 
Markov chain defined by P(x,dy) as $ = {^> n }^ =Q , where the distribu- 
tion of $o will be stated explicitly when needed. As usual, let L 2 (tt) be 
the vector space of real-valued, measurable functions on X that are square- 
integrable with respect to 7T, and let Lq(tt) be the subspace of mean zero 
functions; that is, functions satisfying J x f(x)ir(dx) = 0. Define inner prod- 
uct on this space by (f,g) = J x f(x)g(x)7r(dx). The corresponding norm is 
given by ||/|| = \J (/, /). The Mtf P(x,dy) defines an operator, P, that acts 
on / G Lq(tt) through 



Note that (Pf, f) = Cov(/($ ), /($i)) when <£ ~ n. The chain $ (or, equiv- 
alcntly, the Mtf P) is said to be reversible with respect to tt if for all bounded 
functions /, g G Lq(tt), (Pf,g) = (f,Pg). The norm of the operator P is de- 
fined as 



A straightforward application of Jensen's inequality shows that ||P|| < 1. 

Now assume that J x \h(x)\ir(dx) < oo and that MCMC will be used to 
estimate the intractable expectation irh := J x h(x)ir(dx). If $ is irreducible, 
aperiodic and Harris recurrent (see Meyn and Tweedie [15] for definitions), 
then the ergodic average h n = n _1 J2i=o M*^) converges almost surely to 
Trh no matter what the distribution of $o- This justifies the use of h n as an 
estimator of nh. There are several different methods available for calculating 
the standard error of this estimator (see, e.g., Geyer [5], Hobert, Jones, 
Presnell and Rosenthal [7] and Jones, Haran, Caffo and Neath [8]) and all 
are based on the assumption that there is a CLT for h n ; that is, that there 

exists a a 2 G (0,oo) such that, as n — > oo, y / n(/i n — irh) N(0,cr 2 ). The 




P 




Pf 



<s 
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asymptotic variance, a 2 , depends on both the function h and the Mtf P 
(but not on the distribution of <I>o) so we write it as v(h, P). If the CLT fails 
to hold, then we simply write v(h, P) = 00. 

Unfortunately, even if h G L 2 (tt), irreducibility, aperiodicity and Harris 
recurrence (henceforth "the usual regularity conditions" ) are not enough to 
guarantee that v(h,P) < 00. The chain is called geometrically ergodic if there 
exist M:X-> [0,oo) and p £ [0,1) such that \\P n (x, •) - tt(-)||tv < M(x)p n 
for all x G X and all n = 1, 2, 3, ... , where || • ||tv denotes total variation 
norm. If $ is geometrically ergodic and reversible with respect to tt, then 
v(h, P) < 00 for every h £ L 2 (tt) (Roberts and Rosenthal [17]). Many popular 
Monte Carlo Markov chains have been shown to be geometrically ergodic. 
See, for example, Jones and Hobert [9] and Roberts and Rosenthal [18], and 
the references therein. 

Now suppose that we wish to estimate irh and we have available two 
different Mtf's, P and Q, with invariant probability measure tt such that 
v(h,P) and v(h,Q) are both finite. If P and Q are similar in terms of 
simulation effort, then we would clearly prefer the more efficient chain; that 
is, the chain with the smaller asymptotic variance. Moreover, if v (/i, P) < 
v(h,Q) for all h, then we would prefer P over Q regardless of the function 
h. This discussion motivates the following definitions from Mira and Geyer 
[16]. 

Definition 1. If P and Q are two Mtf's with invariant probability 
measure tt that both satisfy the usual regularity conditions, then P is better 
than Q in the efficiency ordering, written P }ze Q, if v(h,P) < v(h,Q) for 
every h G L 2 (tt). 

Definition 2. If P and Q are two Mtf's with invariant probability 
measure tt, then P dominates Q in the covariance ordering, written P >zi Q, 
if (Ph, h) < {Qh, h) for every h G Lg(vr). 

The following result provides a characterization of the efficiency ordering 
for reversible chains as well as a practical method of proving that P >ze Q- 

Theorem 1 (Mira and Geyer [16]). Let P and Q be two Mtf's that are 
reversible with respect to the probability measure tt and that satisfy the usual 
regularity conditions. Then P Q if &nd only if Phi Q- 

It is important to note that provides only a partial ordering; that is, it 
can happen that neither P ^e Q nor Q ^e P holds. In such a case, neither 
chain is better than the other and the choice between P and Q will depend 
on the particular function to be estimated. 
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Monte Carlo Markov chains can also be compared via their operator 
norms. Indeed, the quantity ||P|| is closely related to the convergence rate of 
the corresponding Markov chain. For instance, if P is reversible with respect 
to 7r and satisfies the usual regularity conditions, then P is geometrically 
ergodic if and only if ||P|| < 1 (Roberts and Rosenthal [17] and Roberts and 
Tweedie [20]). Furthermore, results in Liu, Wong and Kong [12] show that 
the smaller the norm, the faster the chain converges. Examples of the use 
of this criterion for comparing Monte Carlo Markov chains can be found in 
[11, 13, 14]. 

It is important to keep in mind that neither ||P|| < ||Q|| nor P ^e Q 
guarantees that P is a good Monte Carlo Markov chain. Indeed, even if 
< ||Q || j it may be the case that both P and Q are bad chains (with norm 
1) and neither should be used. Similarly, P ^e Q tells us nothing about the 
existence of CLTs for P. However, if P is also known to be geometrically 
ergodic, then we could rule out Q and be content to use P to explore the 
target distribution. The results described above imply that if P and Q are 
both reversible and ||P|| < ||Q||, then geometric ergodicity of Q implies that 
of P. (See Roberts and Rosenthal [19] for some related results.) This result 
can be extremely useful in practice because the better chain (P in this 
case) is typically more complex and hence harder to analyze. This idea is 
exploited in Roy and Hobert [21], who prove that the Haar PX-DA algorithm 
for the probit model (discussed in Section 1) is geometric by showing that 
the simpler DA algorithm of Albert and Chib is geometric. 

3. Improving upon the DA algorithm. In this section, we study Mtds of 
the form (3). Assume that X and Y are locally compact, separable metric 
spaces equipped with their Borel cr-algebras. Assume further that \i x and 
H y are u-finite measures on X and Y, respectively, and that f(x,y) is a 
probability density on X x Y with respect to \i x x \i y . As usual, let fx, 
fy, fx\Y an d fy\x denote the marginal and conditional densities. In this 
context, the DA algorithm has Mtd (with respect to jjL x ) given by 

(5) p(x\x')= J f X \Y(x\y)f Y \x(vW)hi(dy)- 
The analogue of (3) is 

(6) Pr(x\x') = J J^f x \ Y (x\y')R(y,dy')f Y \x(y\x')Hy(dy), 

where R(y,dy') is any Mtf on Y that has f Y as an invariant density. Again, 
straightforward calculations reveal that fx is an invariant density for pr 
and that reversibility of R with respect to fy implies reversibility of pr 
with respect to fx- Varying the Mtf R(y,dy') produces a family of Markov 
chains having fx as invariant density, and (as we explain later) the DA 
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algorithm is one of the family members. In some cases, pr is itself a DA 
algorithm; that is, there exists a probability density f*(x,y) on X x Y with 
respect to fj, x x \x y such that J Y f*( x > y)^y(dy) = fx{x) and such that pr can 
be reexpressed as 



Clearly, if pr is a DA algorithm, then it is reversible with respect to fx- 
We now state a known result about DA that will be used to prove the main 
result in this section. 

Theorem 2 (Amit [2] and Liu, Wong and Kong [11]). Let P denote 
the operator on Lg(/x) associated with p(x\x'). Let (X,Y) ~ f(x,y) and 
heLUfx). Then (Ph, h) = Var[E(7i(X)|F)] and \\P\\ = 7 2 (A, Y), where 
*y(X, Y) is the maximal correlation between X and Y. 

The next result allows us to compare two different versions of (6). 

Theorem 3. Suppose that R and S are two Mtf's on Y that have f Y 
as invariant density and assume that R >zi S. Let pr and p$ denote the 
corresponding versions of (6) and denote the associated operators as Pr and 
Ps- Assume that pr andps satisfy the usual regularity conditions. If pr and 
ps are both reversible with respect to fx, then pr ^ePs- If> * n addition, pr 
and ps are both DA algorithms, then \\Pr\\ < ||-Ps||- 

PROOF. Let <£* = {<£* }%L and l> = {^ n }^ =0 denote stationary versions 
of the chains driven by pr and ps, respectively. Fix h £ Lg(/x) and define 
h*(y) = J x h(x)fx\Y( x \y)lJ'x(dx). It is easy to see that h* £ L^/y). Now 





JxJx 



= / / / / h( x ') h ( x )fx\Y{x\y')R(y,dy')f Y \x(y\x')fx(x') 

J X ./X i/Y «/Y 



x iiy(dy)n x (dx')n x (dx) 




x R(y,dy')f Y {y)^ y {dy) 



= J j h*{y)h*{y')R{y,dy')f Y {y)liy{dy) 
<^^h*(y)h*(y > )S(y,dy > )f Y (y)»y(dy) = (Psh,h), 
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where the inequality follows from the fact that R y± S. It then follows from 
Theorem 1 that pr ^ePs- Now let f*(x,y) and f(x,y) denote the densities 
that allow us to express pr and p$ as DA algorithms. In conjunction with 
the results above, Theorem 2 implies that 

Vax[E(h{X*)\Y*)] = {P R h,h) < {P s h,h)=Vax[E(h(X)\Y)], 

where (X*,Y*) ~ f*(x,y) and (X,Y) ~ f{x,y). Now, since X* = X, we 
have {g:Vavg(X*) = 1} = {g:\ai g{X) = 1}. Suppose that Xarg(X*) = 1 
and put fig = J x g(x)fx(x)dx. Then 

Var[E( 5 pr )|y*)] = Var{E[(^(X*) - Mff )|y*]} 

< Var{E[( 5 (X) - Ms )|Y]} = Vav[E(g(X)\Y)]. 

But it is well know that for random elements U and V, 

1 2 (U,V)= sup Vav[E(h(U)\V)}. 

{h:Vaih(U)=l} 

It follows that \\P R \\=-f 2 (X*,Y*)<~/ 2 (X,Y) = \\P S \\. □ 

Theorem 3 actually allows us to compare the DA algorithm with the 
algorithm based on (6). Indeed, the Mtd (5) can be viewed as a special case 
of (6) where R(y, dy') is taken to be the trivial Mtf that is a point mass at 
y. This trivial Mtf is obviously dominated in the covariance ordering by any 
nontrivial R. We conclude that, if pr can be expressed as a DA algorithm, 
then it is better than the original DA algorithm both in terms of efficiency 
and operator norm. We state this as a corollary. 

Corollary 1. Suppose that R is a Mtf on Y that has fy as invariant 
density. Let pr be as in (6) and denote the associated operator by Pr. As- 
sume that p and pr satisfy the usual regularity conditions. If Pr is reversible 
with respect to fx, then pr }zeP- If, in addition, pr is a DA algorithm, then 
\\Pr\\<\\P\\- 

In order to apply Corollary 1, we must establish that pr is reversible and 
possibly that pr is a DA algorithm. We know that reversibility of R implies 
that of pr. The next result shows that there is also a simple condition on R 
that implies that pr is a DA algorithm. 

Proposition 1. Let R be a Mtf on Y that has fy as invariant density 
and let pr be as in (6). If there exists a Mtf R 1 ^ 2 (y,dy') that is reversible with 
respect to fy and is such that R(y,dy') = JyR 1 / 2 (w,dy')R 1 / 2 (y,dw), thenpR 
is a DA algorithm with respect to f*(x,y) = fy(y) fy fx\y{x\y')R l l 2 {i),dy'). 
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Proof. First, it is easy to see (without using reversibility) that J Y f*(x, y) x 
Vy{dy) = fx(x). Now 



fx\Y( x \v)fY\x(vW)Pv(dy) 



f*(x,y) 



Jxf*( x ,y)/J>x(dx) 



F(x',y) 



J Y f*{x',y)fj, y (dy) 



Vy(dy) 



fxMy'^fady') 



Mv) 



fxiYW^M') 



MOV 



Vy{dy) 



x - TTT Jx\Y{x'\y)fY{y)R 1/2 {y,dy")^ y {dy) 
fx{x') 



fx\Y{x\y')f Y \x{y\x') 



R l l 2 {y,dy")R l l 2 (y'\dy') 



Hy(dy") 



■Pr{x\x'). 



□ 



Two situations where the hypotheses of Proposition 1 are clearly satisfied 
are (i) if R is reversible with respect to fy and idempotent in the sense 
that R(y,dy') = f y R(w,dy')R(y,dw), and (ii) if R is defined to be the Mtf 
corresponding to two consecutive steps of a chain on Y that is reversible 
with respect to fy. 



4. Using group actions to construct Markov transition functions. We 

now use the group structure on G to build two Mtf's, Q r (y, dy') and Q(y, dy'), 
that behave like general versions of the middle steps (Step 2) of the PX-DA 
and Haar PX-DA algorithms described in Section 1. 



4.1. The group structure. Let Y and fy be as defined in the previous 
section and assume now that G is another locally compact, separable metric 
space that is also a topological group. Suppose that the group G acts topolog- 
ically on the left of Y; that is, there is a continuous function F : G x Y — > Y 
such that F(e,y) = y for all y G Y and F(gig 2 ,y) = F(gi, F(g 2 , y)) for all 
Sii <?2 £ G and all y G Y. [Note that F(g,y) is playing the role of t g (y) from 
Section 1.] As is typically done, we will abbreviate F(g,y) with gy so, for 
example, the second condition is written {g\g2)y = <7i(<722/)- 

As in Eaton [3], we use the term multiplier to describe a continuous 
homomorphism of G into the multiplicative group ]R + ; that is, a function 
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X ■ G — ► M + is a multiplier if x is continuous and x(di92) = x(9i)x(d2) for all 
gi, 52 G G. Clearly, if x is a multiplier, then x(e) = 1 and x{9 ) = VxCff)- 
The measure /x^ is called relatively (left) invariant with multiplier x if 

x(sO ^ h(9y)Vy{dy) = h(y)n y (dy), 

for all 5 € 67 and all integrable functions h : Y — ► R. As an example, consider 
the PX-DA algorithm for the probit model that was discussed in Section 1 . In 
that case, the group acts on the left of Y = W 1 through scalar multiplication, 
(g,y) i— ► gy, and fi y , which is Lebesgue measure on R n , is easily seen to be 
relatively invariant with multiplier x(<?) = 9 n ■ 

While all of the examples considered in [13, 14] and [23] satisfy the as- 
sumptions of the previous two paragraphs, this level of generality is not 
quite enough. In order to ensure that our results subsume those of L&W, 
we assume that there exists a function j : G x Y — ► M + such that: 

1- j(9-\y) = j^ V^GCyGY, 

2- j(gi92,y) = j(gi,92y)j(g2,y) ^91,92 ^G, yeY, and 
3. For all g G G and all integrable functions h : Y — ► R, 

(8) J^h(gy)j(g,y)ix y {dy) = J^h(y)^ y (dy). 

Note that when fi y is relatively invariant, we can simply take j(g,y) to 
De xG?)- Now suppose (as in [13]) that Y CR", fj, y is Lebesgue measure 
on Y, and for each fixed g S G, F(g, •) : Y — ► Y is differentiable. Then if we 
take j(g,y) to be the Jacobian of the transformation y F(g,y), the three 
properties listed above follow straightforwardly from calculus. 

4.2. A transition based on a probability measure on G. We now build a 
Mtf, Q r , that is a generalized version of Step 2 of the PX-DA algorithm. Let 
r be a probability measure on G. Define 

m r (y)= fY(gy)j(g,y)r(dg) 

JG 

and assume that m r (y) > for all y £ Y. Define iV = {y G Y :m r (y) = 00} 
and let Y = Y\ N. Note that J y m r (y)fj, y (dy) = 1, which implies that fJ, y (N) = 
0. Assume that gy G Y for all y E Y and all g G 67. A simple calculation shows 
that, for fixed y G Y, 

fY(g'g~ 1 y)j(g',g~ 1 y)/m r (g~ 1 y) 

is a probability density function on 6 x 6 with respect to r x r. Let Q r be 
an operator on Lg(/y) defined as 

u\t \ f f h (9'g' 1 y)fY(g'g~ 1 y)j(g',g~ 1 y) u , (A 

(Q r h)(y)= / — — r(dg)r(dg) 

JgJg rn r (g l y) 
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when y G Y and (Q r h)(y) = J Y h(y) fy (y) n y {dy) when y £ N. This is the op- 
erator corresponding to a Markov chain on Y that evolves as follows. If the 
current state, y, is in Y, then the distribution of the next state is that of 
g'g y where (g,g') is a random element from the density fy(g' 9 ' y)j(g' > 9 ' y) I 
m r (g~ 1 y), and if y G N, then the next state is from fy- Denote the corre- 
sponding Mtf on Y as Q r (y,dy'). We now establish that fy is an invariant 
density for Q r by showing that Q r is reversible with respect to fy. 

Proposition 2. Suppose r is a probability measure on G such that 
m r(y) > for all y £ Y and such that gy 6 Y for all y £ Y and all g G G. 
Then the Mtf Q r is reversible with respect to fy . 

Proof. We prove the result in the case where \i y is relatively invariant 
and leave the extension to the general case to the reader. Let h±,h 2 G Lg(/y) 
be bounded. We will show that (Q r hi,h 2 ) = (hi,Q r h 2 ) ■ Indeed, 

(h 1 ,Q r h 2 ) = III h ^ h ^'9~ 1 y)fY(y)fY(g'g^ 1 y)x(g') 

JgJgJy m^g^y) 

(9) 

x fi y (dy)r(dg)r(dg ). 
Now, since gg' ^g'g 1 = e, the inner integral in (9) can be expressed as 
hiigg'^ 1 g' g~ 1 y)h 2 (g' g~ 1 y)fy{gg'~ 1 g' g~ 1 y)fy(g' g~ l y)x(g' g~ l )x(g) 



/ Y m r (g' 1 g'g 1 y) 

x n y (dy), 

which, using the relative invariance of fx y , becomes 

hi(gg'~ 1 y )h 2 (y)fy(gg'~ 1 y)fY(y)x(g) 

Y 



Hy(dy). 



m r (g' 1 y) 
Thus, (9) can be written as 

JgJgJy m r (g L y) 

= {Q r h 1 ,h 2 ). □ 

Example 1. Let Y = R and take \i y to be Lebesgue measure. Let fy(y) = 
\e~\ y \ and take G to be the multiplicative group on M + . If the group action 
is defined as multiplication, then fx y is relatively invariant with multiplier 
x(g) = g- [We always use x(<?) instead of j(g,y) when fj, y is relatively in- 
variant.] If we take r{dg) to be a probability measure with density e~ 9 on 
the positive half-line, then m r {y) = (1 + |y|)~ 2 G (0, oo) for all y G Y. [For an 
example where m r is not finite everywhere, use (1 + g)~ 2 in place of e~ 9 .] A 
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simple calculation shows that the distribution of the random element (g,g') 
used to make the transitions under Q r can be described as follows. First, 
g ~ Exp(l) and, conditional on g, g' has density (with respect to Lebesgue 
measure on M + ) given by 



fy(g'g 1 y)x(g')e~ 



m r {g x y) 

Hence, g'\g ~ Gamma(2, 1 + \y\/g) and it follows that g'g^ 1 = 
a random variable on ]R + with density given by 

f (v) = v ey:p{-v\y\} - — —r^ + - — ^-77 + 



v, where v is 



12 1 



(v + l) 3 (v + l) 2 ' (v + l) 
Consequently, for measurable AcR, Q r (y,A) = J A q r (y'\y)fi y (dy') where 



~{y'\y) 



\y\\y\ 
W + y\ 



+ 



\y' + y\ W + y 



+ 1 



Clearly, q r (y'\y)fY(y) is a symmetric function of (y',y) so the Mtf Q r is 
reversible with respect to /y as it must be according to Proposition 2. Note 
that the chain is not irreducible. For example, if it is started with j/o > 0, 
then it will never visit the negative half-line. 



4.3. A transition based on left-Haar measure on G. In this section, we 
build on results in Liu and Sabatti [10] to construct a Mtf, Q, that is a 
generalized version of Step 2 of the Haar PX-DA algorithm. We begin by 
describing left-Haar measure and some of its properties. Under the assump- 
tions of Section 4.1 there exists a left-Haar measure, ui, on G, which is a 
nontrivial measure satisfying 

(10) / h(gg)u l (dg)= f h(gMdg) 

Jg Jg 

for all g € G and all integrable functions h : G — ► M. This measure is unique 
up to a multiplicative constant. Moreover, there exists a multiplier, A, 
called the (right) modular function of the group, with the property that 
v r (dg) := A(g^ 1 )ui(dg) is a right-Haar measure, which satisfies the obvious 
analogue of (10). Groups for which A(<?) = 1; that is, for which right- and 
left-Haar measure are equivalent, are called unimodular. We now state two 
useful formulas that will be used repeatedly in the sequel (see Fremlin [4], 
Theorem 442K). If g G G and h : G — > M is an integrable function, then 



(11) 



/ h(gg~ 1 Mdg) = A(g) f h^dg) 
Jg Jg 
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and 

(12) / h(g- 1 )u l (dg)= [ h(g)A(g- 1 )u l (dg). 
Jg Jg 

Now assume that m(y) := f G fY(gy)j(g,y)^i(dg) is positive for all y £ Y 
and finite for /i^-almost all y £ Y. As in Section 4.2, let N denote the /i^-null 
set of y values for which m{y) = oo and set Y = Y \ N. A routine calculation 
shows that, for y £ Y, 

(13) m(gy) =j(g- 1 ,y)A(g- 1 )m(y). 

This formula is basically equation (Al) from [10]. One consequence of (13) 
is that gy £ Y for all y £ Y and all g £G. Let Q be an operator on Lg(/y) 
defined by 

Jg m{y) 

when y £ Y and (Qh)(y) = f Y h(y) fy {y) H y {dy) when y £ N. This is the op- 
erator associated with the Markov chain on Y that evolves as follows. If the 
current state, y, is in Y, then the distribution of the next state is that of 
gy where g is a random element from G whose density (with respect to v{) 
is fY(gy)j{g,y)/ , m(y), and if y £ N, then the next state is from fy- Denote 
the chain and its Mtf by * = {^ n }^ =0 and Q(y,dy'). 



Proposition 3. Suppose that m(y) is positive for all y £ Y and finite 
for Hy-almost ally £ Y so that Q is well- defined. Then the MtfQ is reversible 
with respect to fy- 



Proof. As in the proof of Proposition 2, let hi,h,2 £ L/^(fy) be bounded. 
Then 



(h h Qh 2 )= ! I 
JgJy 



G 



hi(y)h 2 {gy)fy(y)fy(gy)j(g,y) 
m(y) 

hi{g~ 1 y)h 2 (y)fy(g~ 1 y)fy(y) 



H y {dy)ui{dg) 



m(g l y) 



Vy(dy) 



Mdg) 



Y m(y) 
x V y (dy) 
h2{y)fv{y) 



G 



h 1 (g- 1 y)fy(g~ 1 y)j(g- 1 ,y)A(g- l )u l (dg) 



Y m(y) 
(Qh 1 ,h 2 ), 



G 



hi (gy)fv {gy)j (g, y)n {dg) 



Vy{dy) 
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where the second through fourth equalities are due to, respectively, (8), (13) 
and (12). □ 

Compared with Theorem 1 in [10], our Proposition 3 is more general and 
has a stronger conclusion (reversibility versus invariance). 

Example 1 (continued). As noted previously, the multiplicative group 
is unimodular and vi{dg) = dg/g where dg denotes Lebesgue measure on 
R+. Now, m(y) = f G f Y (gy)x(g>i(dg) = PM)" 1 . Therefore, N = {0} and 
Y is the real line less the origin. For y ^ 0, g ~ Exp(|y|) and for measurable 
AcY, Q(y,A) = J A q(y'\y)fi y (dy') where 

q{y'\y) = e~ W \ [I R+ (y)I R+ (y') + I R - (y)/ R - {y')\ . 

Again, the chain is not irreducible. However, for any fixed starting value 
in Y, the random variables ^2, ^3, ■ • ■ are independent and identically 
distributed (i.i.d.). Indeed, if ipo > 0, then ^1,^2,^3, ■■ ■ are i.i.d. Exp(l) 
and if ipo < then ^1,^2, ^3, • ■ • are i.i.d. with common distribution equal 
to that of — Z where Z ~ Exp(l). 

The behavior exhibited by ^ in the example above is not exceptional. 
Indeed, Q has the special property that, conditional on any fixed starting 
value in Y, {^ / n }^ =1 is an i.i.d. sequence (which must be from fy if the 
chain satisfies the usual regularity conditions). We will not prove this result 
here (due to space limitations), but we will prove that Q is idempotent. For 
n€N:= {1,2,3, ...}, let Q n (y,dy') denote the n-step Mtf. 

Proposition 4. Suppose that m(y) is positive for all y £ Y and finite 
for Hy-almost ally G Y so that Q is well-defined. For each y £ Y, Q 2 (y, dy') = 
Q(y,dy') and hence Q n (y,dy') = Q(y,dy') for all n G N. 

Proof. We prove the result in the case where N = and leave the 
extension to the general case to the reader. We will show that for h G Lg(/y), 
{Q 2 h){y) = (Q(Qh))(y) = {Qh){y) for all y G Y. Indeed, 



(Q(Qh))(y) 



G 



Hg'gy)fY(g'gy)j(g',gy) 



G m(gy) 
fv(gy) 



vi(dg') 



fY(gy)j(g,y) 



m{y) 



vi{dg) 



G m(y)m(gy) 
fy(gy) 

G m{y)m{gy) 



G 



h {g'gy) fy (g'gy)j {g'g,y)n {dg ) 



vi{dg) 



G 



A(g- 1 )h(g'y)f Y (g'y)j(g',y)u l (dg'} 



vi{dg) 



G JG 



j(g,y)Hg'y)fY(gy)fY(g'y)j(g',y) 

m(y)m(y) 



n{dg')vi{dg) 
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f Kg'y)f Y (g'y)j(g',y) ' 
J G m(y)m{y) 

(Qh)(y), 



UG 




where the third and fourth equalities are due to, respectively, (11) and (13). 



The discussion preceding the statement of Proposition 4 suggests that it 
might be possible to use Q to make i.i.d. draws from fy. Unfortunately, as we 
now explain, it is typically impossible to simulate Q when the corresponding 
Markov chain is irreducible. Fix y E Y and define 



The set O y is called the orbit of y. The orbits induce an equivalence relation 
on the space Y; that is, two points are equivalent if they are in the same 
orbit. Hence, Y can be partitioned into a collection of orbits. Clearly, when 
the Markov chain driven by Q (or Q r for that matter) is started at the 
fixed value y 6 Y, it remains forever in O y . Therefore, if the probability 
measure associated with f Y puts positive mass on the complement of O y , 
the Markov chain will not be /y-irreducible. Of course, the complement of 
O y definitely has measure zero when O y = Y; that is, when there is only one 
orbit. Unfortunately, when Y and G are Euclidean spaces, the situations 
where there is only one orbit are those in which g and y have the same 
dimension. In practice, sampling from fy(y) is not feasible and hence, if 
g and y share the same dimension, making draws from a density (in g) 
that is proportional to fY(gy)j(g,y) will also likely be impossible. Loosely 
speaking, we are able to simulate Q only when the corresponding Markov 
chain is reducible. While such reducible chains are not particularly useful 
by themselves, they can be used as part of a hybrid chain that is irreducible 
(see, e.g., Liu and Sabatti [10]) and they can be used to improve other chains 
such as the DA algorithm. 

5. General versions of PX-DA and Haar PX-DA. Our general PX-DA 
algorithm has Mtd given by 



We now prove that p r is better than p defined at (5) in both the efficiency 
ordering and the operator norm sense. We accomplish this by showing that 
p r is a DA algorithm. Let P and P r denote the operators corresponding to 
p and p r . 



□ 



Oy = {y 6 Y :y = gy for some g £ G}. 




Jy Jy 



A COMPARISON OF DATA AUGMENTATION ALGORITHMS 



19 



Proposition 5. Let r be a probability measure on G such that Q r is 
well-defined. Then the Mtd p r is a DA algorithm. Thus, if p and p r satisfy 
the usual regularity conditions, then p r cze P an d \\P r \\ < ||-P|| ■ 

Proof. Define f(x, y) = f G f(x, gy)j(g, y)r{dg) and note that J Y f(x, y) x 
fjLy{dy) = fx{x). Hence, / is a joint density on X x Y (with respect to jjL x x fj, y ) 
whose x marginal is fx- For y E Y, 



fx\y{x\y) 



f(x,y) 



I G f(x,gy)j{g,y)r(dg) 



j x f(x,y)fx x (dx) m r {y) 
where, as in Section 4.2, m r (y) = f G fY(gy)j(g,y)r(dg). Also, 

f(x,y) _ J G f(x,gy)j(g,y)r(dg) 



fv\x{y\x) 



J Y f(x,y)ny(dy) fx{x) 

fY\x(gy\x)j{g,y)r(dg). 



Now, 



fx\Y{x\y)fY\x{y\x')n y {dy) 



m r (y) Jg 
x Vy{dy) 



f{x,g'y)j(g',y)r(dg') 



G 



fY\x(gy\x')j(g,y)r(dg) 



G JG 



/_i , f(x,g'g l y)j{g',g l y)fY\x{y\x')^ y {dy) 
•{g L y) 



Y m 
x r(dg')r(dg) 

fx\Y{x\g'g~ l y)fY{g'g~ l y)j{g\g~ 1 y) 
m r {g- l y) 



G JG 



r(dg')r(dg) 



x fY\x(y\x')n y {dy) 



fx\Y{x\y')Qr(y,dy') 



fY\x{y\x')Vy(dy) =p r (x\x'), 



where the second equality is due to (8) and the penultimate equality follows 
from the definition of Q r . We conclude that p r is a DA algorithm. An appeal 
to Corollary 1 yields the result. □ 



In Proposition 5, the efficiency ordering result is new, but a special case of 
the operator norm result (where X, Y & G are Euclidean spaces) is known — 
see L&W's Theorem 2. 
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Our general Haar PX-DA algorithm has Mtd given by 

p*(x\x') = j j fx\Y(x\y')Q(y,dy')f Y \x(y\x')Hy(dy), 



lyjy 

where Q(y,dy') is the Mtf defined in Section 4.3. Let P* denote the corre- 
sponding operator. Our next result establishes that the Haar PX-DA algo- 
rithm is better than every PX-DA algorithm in both the efficiency ordering 
and the operator norm sense. Before we state and prove the result, we ex- 
plain the main idea. The most direct route to a proof would be to show 
that Q >z\ Q r for every r(dg), and then apply Theorem 3. However, we have 
not been able to establish that Q >z\ Q r . Alternatively, the reason we found 
success in comparing p r and p is that p r is an improvement of the DA al- 
gorithm. At first glance, there is no such connection between p* and p r . 
However, Proposition 5 says that p r is a DA algorithm and it turns out that 
p* can be represented as an improvement of p r . 

Theorem 4. Let r be any probability measure on G such that Q r is well- 
defined. Suppose that m(y) is positive for all y £ Y and finite for ^i y - almost 
all y G Y so that Q is well-defined. If p r and p* satisfy the usual regularity 
conditions, then p* ^EPr and \\P*\\ < \\P r \\- 

Proof. We prove the result in the case where N = (for both m r and 
m) and leave the extension to the general case to the reader. We know from 
Proposition 5 that p r is a DA algorithm with respect to the joint density 
f(x,y) = J G f{x,gy)j{g,y)r(dg) and that f x f(x,y)fx x (dx) = m r (y). Let Q 
be the Mtf on Y with invariant density m r (y) that is constructed according 
to the recipe in Section 4.3; that is, Q is what we would have ended up with 
had we used m r (y) in place of fy(y) in Section 4.3. We will show that 

(14) p*(x\x') = J y J y fx\Y(x\y')Q(y, dy')f Y \x(y\x')fi y (dy); 

that is, p* is an improvement of p r . First, if we substitute m r {y) for /y(y) 
in the definition of m(y), we have 



m r (gy)j(g,y)vi(dg) -- 

G JG 



G 



G 



G 



fY(g'gy)j(g',gy)r{dg') 
fY(g'gy)j(g'g,y)Mdg) 



j{g,y)vi(dg) 

r(dg') 



fY{gy)j(g,y)n(dg) =m(y). 

IG 

Hence, the function m(y) is the same whether we use fy or m r . Now, using 
the definition of Q and the calculation above, we have 

/ fx\Y(x\y')Q(y,dy') = -^ f fx\Y(x\g"y)m r (g"y)j(g" ',y)vi(dg"). 
Jy 1 m(y) Jg 
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Thus, 



fx\Y(x\y')Q{y,dy') 



fY\x(y\x')v y (dy) 



G 



f(x,g'g"y)j(g',g"y)r(dg') 



G 



G 



m r (g"y) 

fY\x(.gy\x')j{g,y)r{dg) 



m r (g"y)j{g" ',y)ui(dg") 



m(y) 



Hy{dy) 



GJGJGUY 



f(x, g'g"y)j(g'g", g gy)j(g, y)fY\x{gyW 
m(y) 



-Vy(dy) 



x v^dg'yidg'yidg) 

f(x,g'g"g~ 1 y)j(g'g" 1 g~ 1 y)f Ylx (y\x') 



Vy(dy) 

GJGJGUY rn{g L y) 

x nidg'yidg'yidg) 

f(x,g'g"g~ 1 y)j(g'g"g~ 1 ,y)A(g~ 1 )f Ylx (y\x') 



Mdg"] 



GJGJY 



G m(y) 
x Vy(dy)r(dg')r(dg) 

f(x,g'g"y)j(g'g",y)f Y \x(y\x') 



n{dg") 



VUG rn(y) 
x Vy{dy)r{dg')r{dg) 

f(x,g"y)j(g",y)f Ylx (y\x') 



GJGJY 



G 



m(y) 

f(x,g"y)j(g",y)f Y \x(y\x') 
m(y) 

fx\Y(x\g"y)f Y (g"y)j(g",y) 



n{dg" 



G 



m{y) 
fx\Y(x\y')Q(y,dy') 



vi{dg" 
Mdg") 



li y (dy)r(dg')r(dg) 
Hy(dy) 

fY\x(y\x')v y {dy) 



f: 



Y\X\ 



\x')fi y (dy) =p*(x\x'), 



where the second equality follows from the properties of j, the third is from 
(8), the fourth is due to Fubini and (13), the fifth is a consequence of (11), 
the sixth is due to the left-invariance of v\ , the seventh follows from the fact 
that r is a probability measure, and the penultimate equality is due to the 
definition of Q. Proposition 4 implies that Q(y,dy') is idempotent and it 
follows from Proposition 1 that p*(x\x') is a DA algorithm. An application 
of Corollary 1 yields the result. □ 
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L&W proved that ||P*|| < ||-fr|| i n the special case where X, Y and G are 
Euclidean spaces and G is a unimodular group. Their proof relies heavily on 
a further assumption regarding the group structure that we now describe. 
Recall that Y can be partitioned into a set of orbits. A cross section is 
basically a subset of Y that intersects each orbit exactly once (see, e.g., 
Wijsman [24]). L&W assume the existence of a cross-section and a corre- 
sponding diffeomorphism that allows one to express each point in Y in terms 
of two quantities — its orbit and its position within its orbit. As L&W point 
out, the existence of a cross-section and an associated diffeomorphism is not 
guaranteed in general. 

Recall from the discussion in Section 1 that L&W and M&vD developed 
(what appear to be) different strategies for handling the case in which r is 
improper. We now demonstrate that the general Haar PX-DA Markov chain 
can be viewed as a marginal Markov chain associated with a nonpositive 
recurrent chain on a larger space. This result implies that, when the group 
structure is present, M&vD's chain (with left-Haar measure for the working 
prior) is exactly the same as L&W's Haar PX-DA algorithm. Suppose, as in 
most of the interesting applications, that vi(G) = oo. Following the ideas in 
M&vD, consider the function mapping X x Y x G into [0, oo) that is defined 
by f(x,y,g) = f(x,gy)j(g,y). Now since 

f{x,y,g)^x{dx)^y(dy)vi(dg) =i>i(G), 

igjyjx 

f(x,y,g) is not integrable and therefore cannot be normalized to be a prob- 
ability density function with respect to fj, x x fj, y x v\. On the other hand, we 
can formally define "conditional" densities based on / as follows: 



f(y\x,g) 



f(x,y,g) 



Iy f(x,y,g)^ y (dy) 
f(x,gy)j(g,y) 



and, for y £ Y, 

f(x,g\y) 



fY\x(gy\x)j(g,y), 
fx{x) 1 

f(x,y,g) _ f(x,gy)j(g,y) 



J G J x f{x,y,g)n x (dx)u l {dg) m(y) 

Therefore, despite the fact that / is not a density, 

P*((x, g)\(x',g')) = f(x, g\y)f(y\x, g')n y {dy) 

is still a "DA-type" Mtd on X x G. A routine calculation reveals that fx(x) x 
fjL x {dx)vi{dg) is an invariant measure for the corresponding Markov chain, 
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which we denote by {(X n , G n )}™ =0 . However, J G J x f x (x)/J, x (dx)i'i(dg) = 
vi(G), and hence the chain cannot be positive recurrent (Hobert [6]). On 
the other hand, the density of X n+ \ given (X n ,G n ) = (x',g') is 



G 



p*((x,g)\(x',g')Mdg) 



G 



f(x,gy)j(g,y) 



fY\x(g'yW)j(g' ,y)ny{dy) 



G 



m(y) 

f(.x,gg'~ 1 y)j(g,g'~ 1 y) 

m(g'~ l y) 

f{x,gg'~ l y)j{g,g'~ 1 y) 
m(y)j(g',y)A(g') 

f(x,gg'~ 1 y)j(gg'~ 1 ,y)A(g'~ 1 ) 
m(y) 

f(x,gy)j(g,y) 



fY\x{y\x')v y (dy) 
fY\x(y\x')Mdg) 



»i{dg) 

n(dg) 
Vy(dy) 



fY\x(y\x')Mdg) 



Vy(dy) 



fY\x{y\x')vi{dg) Hy(dy) =p*(x\x'), 
IG m(y) j 

where the second equality follows from (8), the third is due to Fubini and 
(13), the fourth is a consequence of the properties of j and the fifth equality 
is due to (11). Since J G p* ((x, g)\(x' , g'))vi(dg) does not depend on g' , it 
follows that {X n }^ =0 itself is a Markov chain and the previous calculation 
shows that it is precisely the Markov chain driven by p*(x\x > ). 
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